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[57] ABSTRACT 

Disclosed is a method and apparatus for reconstructing data 
in a computer system employing a modified RAID 5 data 
protection scheme. The computer system includes_a_write 
back cache composed of non-volatfle~mem6Ty f67sloring~(l) 
_ .writes outstanding to a device and associated^data readpand 
^(2)- storing met adata^info rmatipn ^ in-th&ITno n^ojatile 
^e^oryrThVmetadata ,includesji firsi^eld comainingjhe 
logical block number or address (LBN or LB A) of the'data, 
c a^second fiejd-contaming-the^ device ip,jind a"thir~d~field 
f containing4heJrtqck status. From the metadata information 
it is determined where the write was intended when the crash 
occurred. An examination is made to determine whether 
parity is consistent across the slice, and if not,lbej|atajnJhe 
non-volatile write back cache is used to reconstr^lhe write 
that was occurring when the^crash^occurred to insure con- 
sistent parity, so that only those blocks affected by the crash 
have to be reconstructed. 

19 Claims, 7 Drawing Sheets 
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ENHANCED RAID WRITE HOLE 
PROTECTION AND RECOVERY 

CROSS-REFERENCES TO RELATED 

APPLICATIONS 5 

This application is related to co-pending Application Ser. 
No. 08/542,827 "A RAID ARRAY DATASTORAGE SYS- 
TEM WITH STORAGE DEVICE M ETAD ATA AND RAID 
SET METADATA," and assigned to the assignee of this 
invention, filed on even date herewith. 10 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

The present invention relates to RAID (Redundant ^Arrays 
of Independent Disks) architecture and systems and more 
particularly relates to methods and apparatus for enhancing 
^RAID -write_hole protection and recovery from system 
failures which would normally leave a "write hole". 

2. Description of Related Art ^ 
The last twenty years have witnessed a revolution in 

information technology. Information handling and process- 
ing problems once thought to be insoluble, have been 
solved, and the solutions have become part of our daily lives. 
On-line credit card processing, rapid and reliable travel 25 
reservations completely automated factories, vastly 
improved weather forecasting through satellites, world wide 
Internet communications and many other breakthroughs all 
represent formidable computing challenges, yet all have 
become commonplace in our lives. 30 

These challenges have been met and mastered in large 
part because of the rate of technological progress in the 
components that comprise computer systems: 

Computers themselves have decreased in size from huge 
racks of equipment to units that can be placed on or beside 35 
a desk. While they have been shrinking physically, their 
capability to deliver cost effective high-performance com- 
puting has been doubling about every three years. 

Graphical capabilities have progressed from the simple 
fixed-font character cell display to high-resolution multi- 40 
color displays, with advanced features such as three dimen- 
sional hardware assist becoming increasingly common. 

Networks have proliferated. Low-cost, easily accessible 
communications capacity has grown from thousands of bits 
per second to tens of million bits per second, with billions 45 
of bits per second capability starting to appear on the 
horizon. 

The shrinking computer has combined with the local area 
network to create client-server computing, in which a small 5Q 
number of very powerful server computers provide storage, 
backup, printing, wide area network access, and other ser- 
vices for a large number of desktop client computers. 

The capabilities of the bulk data storage for these 
machines* are equally impressive. A billion bytes of mag- 55 
netic storage, that thirty years ago required as much space 
and electrical power as a room full of refrigerators, can now 
be easily held in one's hand. 

With these capabilities has come a dependence upon the 
reliable functioning of computer systems. Computer systems so 
consisting of components from many sources installed at 
many locations are routinely expected to integrate and work 
flawlessly as a unit. 

Redundant Arrays of Independent Disks, or RAID tech- 
nology is sweeping the mass storage industry. Informed 65 
estimates place its expected usage rate at 40% or more of all 
storage over the next few years. 



There are numerous RAID techniques. They are briefly 
outlined below. A more thorough and complete understand- 
ing may be had by referring to "The RAIDbook, A Source 
Book for Disk Array Technology" the fourth edition of 
which was published by the RAID Advisory Board 
(RAB™), St. Peter, Minn. 

The two most popular RAID techniques employ either a 
mirrored array of disks or striped data array of disks. A 
RAID that is mirrored presents very reliable virtual disks 
whose aggregate capacity is equal to that of the smallest of 
its member disks and whose performance is usually mea- 
surably better than that of single member disk for reads and 
slightly lower for writes. 

A striped array presents virtual disks whose aggregate 
capacity is approximately the sum of the capacities of its 
members, and who's read and write performance are both 
very high. The data reliability of a striped array's virtual 
disks, however, is less than that of the least reliable member 
disk. 

Disk arrays may enhance some or all of three desirable 
storage properties compared to individual disks: 

They may improve 1/0 performance by balancing the I/O 
load evenly across the disks. Striped arrays have this 
property, because they cause streams of either sequen- 
tial or random I/O requests to be divided approximately 
evenly across the disks in the set. In many cases, a 
mirrored array can also improve read performance 
because each of its members can process a separate 
read request simultaneously, thereby reducing the aver- 
age read queue length in a bus system. 

They may improve data reliability by replicating data so 
that it not destroyed or inaccessible if the disk on which 
it is stored fail. Mirrored arrays have this property, 
because they cause every block of data to be replicated 
on all members of the set. Striped arrays, on the other 
hand do not, because as a practical matter, the failure of 
one disk in a striped array renders all the data stored on 
the array virtual disks inaccessible. 

They may simplify storage management by treating more 
storage capacity as a single manageable entity. A sys- 
tem manager who managing arrays of four disks (each 
array presenting a single virtual disk) has one fourth as 
many directories to create, one fourth as many user disk 
space quotas to set, one fourth as many backup opera- 
tions to schedule etc. Striped arrays have this property, 
while mirrored arrays generally do not. 

With respect to classification (sometimes referred to as 
levels), some RAID levels are classified by the RAID 
Advisory Board (RAB™) as follows: 

Very briefly, a RAID 0 employs striping, or distributing 
data across the multiple disks of an array of disks by 
striping. No redundancy of information is provided but 
data transfer capacity and maximum I/O rates are very 
high. 

In RAID level 1, data redundancy is obtained by storing 
exact copies on mirrored pairs of drives. RAID 1 uses 
twice as many drives as RAID 0, has a better data 
transfer rate for read but about the same for write as to 
a single disk. 

In RAID 2, data is striped at the bit level. Multiple error 
correcting disks (Data protected by a Hamming code) 
provides redundancy, a high data transfer capacity for 
both read and write, but because multiple additional 
disk drives are necessary for implementation, not a 
commercially implemented RAID level. 

In RAID level 3: Each data sector is subdivided ancTthe^ 
data is striped, usually at the byte level across the disk 
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drives, and one drive is set aside for parity information. Xoring the result with the data bit on drive #3 and then 

Redundant information is stored on a dedicated parity Xor'ing that result with the data bit on drive #4. The result 

disk. Very high data transfer, read/write I/O. is the missing data bit which is attributable to that missing 

In RAID level 4, data is striped in blocks, and one drive bit in slice 1 on drive 2. This activity continues for each bit 

is set aside for parity information. 5 of the block. Once again, inasmuch as the process of 

In RAID 5, data and parity information is striped in^ determining parity is commutative, that is the order of 

- Blocks and is rotated among all drives on the array. Xoring is unimportant, the order of making the determina- 

Because RAID 5 is the Raid level of choice, it shalTbe tion is unimportant, 

used in the following discussion. However, much of what This reconstruction is accomplished by the RAID"" 

follows is applicable to other RAID levels, including RAID 10 controller, in conjunction with array management software^ 

levels 6 et seq., not discussed above. The invention is which examines the sum of each BIT position to assign an 

particularly applicable to Raid levels employing a parity or even or an ODD number to disk 5. If a disk fails a 0 or a 1 

ECC form of data redundancy and/or recovery. is assigned to the missing value and a simple calculation is 

Raid 5 uses a technique (1) that writes a block of data performed. The missing bit is the Xor of the members 

across several disks (i.e. striping), (2) calculates an error 15 including parity. This process is repeated, and the data is 

correction code (ECC, i.e. parity) at the bit level from this rebuilt. If a disk drive (#2 in the example) has failed, and 

data and stores the code on another disk, and (3) in the event information on that disk is called for by the user, the data 

of a single disk failure, uses the data on the working drives will be built on the fly and placed into memory until a 

and the calculated code to "Interpolate" what the missing replacement drive may be obtained. In this manner, no : data 

data should be (i.e. rebuilds or reconstructs the missing data 20 is lost. By way of definition, "consistent parity" is the parity 

from the existing data and the calculated parity). A RAID 5 as recorded on the media which is the Xor of all the data bits 

array "rotates" data and parity among all the drives on the as recorded on the media. It should be understood that in the 

array, in contrast with RAID 3 or 4 which stores all calcu- event that the data from one of the members becomes 

lated parity values on one particular drive. The following is unavailable, that data can be reconstructed if the parityjs 

a simplified example of how RAID 5 calculates ECCs (error 25 consistent. . 

correction codes, or commonly referred to as parity), and A write hole can occur when a system crashes or there is 

restores data if a drive fails. a power loss with multiple writes outstanding to a device or 

By way of example, and referring to the prior art drawing member disk drive. One write may have completed but not 

of FIG. 1, assume a five-drive disk array or RAID set on all of them, resulting in inconsistent parity. Prior solutions 

which it is intended to store four values, e.g. the decimal 30 have been to reconstruct the parity for every block. Since 

numbers 150, 112, 230, and 139. In this example, the this is accomplished at the bit level, as may be imagined, this 

decimal number 150 is binary 10010110 and is to be written can be very time consuming if billions of bits are involved, 
on disk 1. The number 112 as binary number 01110000 to be 

written on disk 2, the number 230 or binary number SUMMARY OF THE INVENTION 

11100110 on disk 3 and the number 139 as binary number 35 , * tU . ... . . , t Ctt _ 

mnnmn a- 1 a nn. »u e 1 •** * In view of the above, U is a principal object of the present 

10001011 on disk 4. When the four values are written to . . / , . r 1 j ,u r 

a- \ 1 a .u nAin .11 *i_ j- • • • invention to provide a technique, coupled with apparatus, for 

disks 1-4, the RAID controller examines the sum of each bit . \ , , - ..^ ... % M . t r ^ - 

, , . „ , TC r iL inhibiting write hole formation while facilitating recovery of 

position, in what is called a slice. If the sums of the bit « . t , , „ & J 

. write notes 
position is an odd number then the odd number 1 is assigned 

as the parity number; if the sum is an even number, then that 40 Another object of the present invention is to provide a 

is designated an even number, "0" (It should be noted that method of markin S Qew data prior to its being written as well 

if a reconstruct algorithm (RW) is employed, the parity may as storin 8 old data which P en nits localized reconstruction of 

be calculated prior to any writes and that writes are done data withoul havin g t0 ^construct the entire array of disks 

essentially in parallel and simultaneously. Thus when the wneD a crash or failure occurs - 

calculation for parity is accomplished is primarily a function 45 Yet another object of the present invention is to provide 

of the choice of algorithm as well as its precise implemen- means for quickly identifying possible bad data blocks and 
tation in firmware.) Another way in which parity^is^ write holes and permitting the most expeditious recovery 
determined, as will be more fully exemplified below, is to, including possible reconstruction from precautionary mea- 

exclusive OR [Xor] the first two consecutive bits of a slice, sures taken prior to the crash or failure of the system. 

Xor the result with the next bit and so on, the Xor with the 50 These and other objects are facilitated by ernjloying aj? 

final bit of the last data-carrying-drive being the parity. nonvolatile write back cache and by storing tiie* metadata Y 

However, it should be recognized that the process of deter- information in nonvolatile memory as well. Cached meta- 

mining parity is commutative, that is the order of Xoring is data includes at least three fields. One field contains the LBN 

unimportant. " (and thus the Logical Block Address [LB A]) of the material 

Assume disk 2 fails, in the example of FIG. 1. In that 55 to be written, another field contains the device ID, and the 

event, the following occurs. The RAID disk controller no tod field contains the block status (dirty, clean, inconsistent 

longer ascertains that the value of bit 7 is a 0 on disk 2. parity on slice, parity valid). [Hereinafter, crash « an unin- 

However, the controller knows that its value can be only a tentional power or function cessation, which could be from 

0 or a 1. Inasmuch as disks 1, 3, 4 & 5 are still operating, the a controller, cache, memory drive, computer system etc. 

controller can perform the following calculations: 1+7+1+ 60 unexpectedly ceasing operation due to power loss, failure 

1-an odd number, or 1. Since 1+0+1+1- an odd number, etc.] In the instance where a crash occurs during a "write" 

then the missing value on disk 2 must be a 0. The RAID to disk, it is possible the "write" wrote some, but.not all,.of j 

controller then performs the same calculation for the remain- the data blocks and the parity block. This results in incon- 

ing bit positions on disk 2, In this way data missing due to" sistent parity across the slice. Since a write-back cache^ 

a drive failure may be rebuilt. Another, and often more 65 \sjbich is non-volatile is employed, the data that was to be 

convenient way of determining the missing value, i.e. a 0or written is still retained, and by use of the metadata infor- J 

a 1, is by Xor'ing the parity with the data bit on drive #1, mation which is also saved in non-volatile memory, where 
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the write was intended is known. Thus the write that was 
occurring when the crash took place can now be recon- 
structed and it can be insured that the parity is consistent. In 
this manner only the blocks affected by the crash may be 
corrected, and parity for every block does not have to be 
recalculated. 

Other objects and a more complete understanding of the 
invention may be had by referring to the following descrip- 
tion taken in conjunction with the accompanying drawings 
in which: 

BRIEF DESCRIPTION OF THE DRAWING(S) 

FIG. 1 is a table illustrating how parity is determined in 
a five disk drive array for a RAID set; 

FIG. 2 is a table showing a five disk drive array for a 
RAID level 5 arcbitectured RAID set, with striping of data 
and parity, and including the hidden or metadata blocks of 
information for both devices and the RAID set; 

FIG. 3 is a block diagram illustrating a preferred redun- 
dant controller and cache arrangement which may be 
employed in accordance with the present invention; 

FIG. 4 is a diagrammatic representation of sample data 
mapping for a RAID level 5 array, illustrating two block 
high data striping with interleaved parity and left symmetric 
parity rotation, and which also shows the relationship 
between the "array management software", the virtual disk 
and the member disks of the array; 

FIG. 5 is a table depicting how parity may change with the 
writing of new data into an array member; 

FIG. 6 is a flow chart illustrating the steps employed for 
a Read, Modify and Write (RMW) Algorithm using the 
novel techniques of the present invention to inhibit the 
formation of write holes; 

FIG. 7 is a table illustrating how either technique, RMW 
or reconstruct write (RW) algorithms may be employed in 
accordance with the invention, and; 

FIG. 8 is a flow diagram of the RW algorithm modified 
with the teachings of the present invention. 

DESCRIPTION OF THE ILLUSTRATIVE 
EMBODIMENT 

Referring now to the drawings and especially FIG. 2, a 
typical five member RAID Array 10 is depicted in the table, 
each member disk drive being labeled #l-#5 respectively 
across the top of the table. As was discussed above with 
reference to FIG. 1, a slice extends, at the bit level, across 
members or disk drives of the Array. In a Raid 5 device, the 
data is placed on the disks in blocks, for example each of 512 
bytes and given a Logical Block Number (LBN). For 
example, as shown in the table of FIG. 2, block 0, drive #1, 
block 4, drive #2, block 8 drive #3, block 12, drive #4 and 
parity in an unnumbered block in drive #5. (Note that each 
block that contained the error code information [parity] 
would be numbered with a SCSI [Small Computer System 
Interface] number, all blocks in a single drive being num- 
bered consecutively, but for purposes of discussion and 
simplification have not been given an assignment number 
herein. Thus the numbered blocks, i.e those numbered in the 
table and labeled LBN, represent data.) A single "slice" 
would be a single bit of each of block 0, 4, 8, 12 and a parity 
bit. Accordingly there would be 512 bytes times 8 bits/byte 
slices of bit size data, A "chunk'*, is defined as the number 
of blocks placed on one drive before the next LBN block of 
data is written on the next drive. In the illustrated instance 
in FIG. 2, a chunk includes 4 blocks and a "strip", while 
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capable of being defined a number of ways, is shown as 
being 4 blocks in depth times 5 drives or really 16 blocks of 
data plus 4 blocks of parity information. (In the example or 
sample shown in FIG. 4, a chunk is equal to two blocks; a 

5 strip is equal to 8 data blocks and two parity blocks.) Thus 
strip 0 in FIG. 2 includes user data blocks, or LBN's, of 
0-15+four parity blocks, while strip 1 includes LBN's of 
16-31 plus 4 parity blocks, and Strip 2 includes LBN's of 
32-47 with four blocks of parity, and so on. As is shown in 

10 the abbreviated table of FIG. 2, the Raid is called a "left 
symmetric rotate" because the four high (sometimes referred 
to as "deep") parity blocks move around from drive #5 in 
strip 0 to drive #4 in strip one to drive #3 in strip 3 and so 
on. Notice also that the LBN continues its numbering 

15 sequence in the same spiral fashion, LBN 16-19 being 
written in strip 1 to drive #5, and continuing through drive 
#1 (LBN 20-23), through LBN 31 appearing in drive #3. 

Every disk drive carries certain extra information on it to 
permit identification of the disk drive, as well as other 

20 pertinent information. Each drive, at the lower extent of the 
table in FIG. 2, includes device specific information, includ- 
ing an ID block, an FE block, and an FE-Dir block. Note that 
the total lengths of the blocks under each drive are repre- 
sentative of the size or capacity of each drive. For example, 

25 disk drive #1 is the smallest drive, disk drives #2-#4 are the 
same size, and disk drive #5 is the largest capacity drive. 
Because drives are interchangeable, and a large drive can be 
used in an array with smaller capacity drives, a substitute or 
spare disk drive can be installed in the RAIDset as long as 

30 the substitute drive has at least as much capacity as the 
smallest drive in the RAIDset at the time of its creation. The 
rest of the spaces are not used by the user, and are considered 
and labeled "Fill Space**". 
TAirning now to the device specific information carried by 

35 each of the drives, and as illustrated in FIG. 2, the lowest 
boxes labeled "ID" and associated with each of Drives 
#l-#5 representatively contain such information as RAID 
membership information, (order, serial number of all 
members, EDC on ID information to protect metadata 

40 format) This inform aton is contained in data blocks on the 
disk and are readable by the disk controller. This is included 
as part of the device or member metadata illustrated in FIG. 
2 . Part of the device or member metadata is a forced error 
box labeled FE, one such box being associated with each 

45 disk drive #l-#5. The FE box represents one or more blocks 
of data. Within the FE blocks are a single bit per device data 
block, i.e. LBN and parity. Anotherwords, each data block 
on each of the drives has an FE bit. The single bit represents 
whether the associated block can be reliably used in Xor 

50 calculations. If the bit is 'set', (e.g. a "1"), as will be 
explained hereinafter, the data block is considered bad as to 
its reliability, and therefore the block, or any of its bits, 
cannot be used for Xor calculations. There are enough 
blocks of FE bits to represent all blocks in the device, 

55 including the "Fill Space" of larger devices, but not the 
metadata itself. The third and last box, which includes, 
especially with larger drives, several blocks of data written 
on each disk or device and labelled "Forced Error Directory 
Blocks", or FE-Dir. Each FE-Dir block contains 1 bit for 

60 every block (512 bytes in our example) of forced error bits. 
This block is used for a quick lookup of suspect data or 
parity blocks. If a bit is not set, then there are no forced 
errors in the Logical Block Address (LBA) range the FE 
block covers. For faster lookup, the FE-DIR information is 

65 cached in the controller cache. (Described hereinafter). 
One other group of blocks which exists and are written to 
the disks, are those containing RAIDed metadata. These 
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blocks contain much the same information as in the indi- Although not shown in the drawing, the commercial unit 
vidual device metadata blocks of information, with the mentioned above includes in the backplane a direct corn- 
exception that since they are RAIDed, parity is employed so munication path between the two controllers by means of a 
that information can also be recovered here, if a block of ser ial communication universal asynchronous receiver/ 
data or parity as such is considered bad, or a drive is 5 transmitter (UART) on each controller. The controllers use 
removed etc. The ID box in the RAIDed Metadata repre- ^ communication link to inform one another about con- 
sentatively contains RAIDset information (serial number, troller initialization status. In a dual-redundant 
size), and EDC on ID mformadon to protect the metadata configura ti on , such as the configuration 20 shown in FIG. 3, 
format. The RAIDed forced error "FE bits stilt employ 1 bit a controller that is initializing or reinitializing sends infor- 
per RAIDed user data block, representing whether a block is 10 mation a5out the process t0 the other contro Uer. Controllers 
suspect and its data cannot be relied upon. (This does not send keep dive messages t0 each other at timed intervals, 
mean it is unusable, as with the device specific metadata FE ^ cessalion Q f communication by one controller causes a 
block, only unreliable or suspect.) The Forced Error Direc- « faik)Vcr » to occur once the surviving controller has dis- 
tory box, which representatively may contain multiple disk a51ed the other controller. In a dual-redundant configuration, 
blocks, contains 1 bit per disk block of forced error bits. Like 15 $ one controller fails, all attached storage devices continue 
its device specific partner FE-DIR, it is used for quick to be ^ed. This is called "failover". Failover occurs as has 
lookup of suspect blocks. Moreover, if there is any fill space been previously mentioned, because the controllers in a 
in the RAIDed Metadata strip, this fill will take up the extra dual-redundant configuration share SCSI-2 device ports and 
space to the stripe boundary. These blocks have been labeled therefore access to all attached storage devices. If failover is 
Fill Space*, (note the single asterisk to denote RAIDED area 20 t0 be achieved> the surv iving controller should not require 
fill Space). access to the failed controller, The two way failover corn- 
Forced error promotion occurs when two blocks within a munication line 37 is depicted in FIG. 3. 
strip [have unrecoverable read errors or corresponding device storageWorks™ controllers in a dual redundant configu- 
FE bits set. For example suppose block 6 and the parity ration have ^ same ^ tioa infor m a tion at all times, 
block in Strip 0 are unreadable TTi^s means that there is no 25 Whea ^ tion mformation is entered into one control- 
longer redundancy for blocks 2, 10 and 14. In order to regain lef ^ conlrol i er ^ ^ new information to the other 
reaunaancy, me ronowing procedure may be employed: cont roller. Each controUer stores this information in a con- 
Wnte the RAIDed FE bit corresponding to the user data 4 „ ... , 1C . „ c 
. , , , , . • a /. r i_ j t_i 1 i troller resident nonvolatile memory. If one controller fails, 
block lost in strip 0; (if there were more data blocks lost, me SUTviy - con troUer continues to serve the failed con- 
write 0 to all lost .data blocks); Calculate panty,mcluding he 30 troller > s devices to host computers, thus obviating shared 

zeroed block just written, and good data blocks 2, 10 and 14; m „ m „„, Q ™co ^ n ^„L _ ( nn „ 

. i j i . L , ti t • ™ memory access. The controller resolves any discrepancies 

Write the parity block; and clear parity block's device FE, Kv „ • „ # . :„c nr . m ^: n „ 

j .u i * j I ui 1 j • n4in . t^i - * i • A ! by using the newest information, 

and the lost data blocks device FE. The RAIDed FE bit set 0 * „ 

denotes that block 6 has been lost. A subsequent write of firmware components within a controller can 

block6 will writecorrectdatawithfuUredundancy and clear 35 c™mcate with the other controUer to synchronize spe- 

the RAIDed FE bit. It is the time from when block 6 has cial events between the hardware 0D both controllers. Some 

been lost to the point that it has again written with new good examples of these special events are SCSI-2 bus resets, 

data when forced error promotion in RAID is quite useful. cache state c *™^ and diagnostic tests. 

FE promotion allows the remaining other 'good' blocks of Each controller can sense the presence or absence of its 

data to be provided with full RAID-5 data integrity even in 40 cacnc t0 set up cache diagnostics and cache operations and 

the face of multiple block errors existing within the strip of can sense the presence or absence of the other controller's 

a RAID set. cache for dual-controller setup purposes. 

For other information on fast initialization and additional The failover of a controller's cache occurs only if write- 

material on metadata, and for a more complete understand- back caching was in use before the controller failure was 

ing of the RAIDed Metadata, patent application Ser. No. 45 detected. In this case, the surviving controller causes the 

08/542,827, filed on even date herewith, entitled "A RAID failed controller's cache to write its information into the 

ARRAY DATA STORAGE SYSTEM WITH STORAGE surviving controllers cache. After this is accomplished, the 

DEVICE METADATA AND RAID SET METADATA" and cacne is released and access to the devices involved is 

assigned to the assignee of this invention, is hereby incor- permitted. The cache then awaits the failed controller's 

porated by reference. 50 return to the dual-redundant configuration through reinitial- 

Turning now to FIG. 3, a system block diagram is shown i^ 00 or replacement, 

including a dual-redundant controller configuration 20. The If portions of the controller buffer and cache memories 

configuration may contain StorageWorks™ components, ^. the controller continues normal operation. Hardware 

HS-series controllers manufactured by Digital Equipment error correction in controller memory, coupled with 

Corporation, As shown, each controller 21, 22 is connected 55 advanced diagnostic firmware, allows the controller to sur- 

to its'own cache 23 and 24, each with bidirectional lines 25, v * v e dynamic and static memory failures. In fact, the con- 

26 and 27, 28. Each controller 21, 22 is connected in turn troller will continue to operate even if a cache module fails, 

through I/O lines 29 and 30 respectively to a host interface For mc > re information concerning the design and architecture 

35, which may include a backplane for power^supply and of HS-series StorageWork™ array controllers, see Vol. 6, 

bus connections and other things as discussed below. One or 60 No. 4 » Fall 1994 (published 1995) issue of the "Digital 

more host computers or CPU's 31 and 32 may be connected Technical Journal", page 5 et. seq. 

to the host interface 35. The backplane includes intercon- Referring now to FIG. 4, shown is a diagrammatic rep- 

troller communication, control lines between the controllers resentation of sample data mapping for a RAID level 5 array 

and shared SCSI-2 device ports such as shown schematically 50, illustrating two block high data striping (as opposed to 

at 33 and 34. . Since the two controllers share SCSI-2 device 65 four in FIG. 2) with interleaved parity and left symmetric 

ports the design enables continued device availability if rotation, and which also shows the relationship between the 

either controller fails. *' array management software" 60, the virtual disk 65, and the 
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member disks 1-5 of the array 50. RAID level 5 in its pure entire slice from 0 to 1. Of course if the new bit state and the 

form was rejected because of its poor write performance for old bit state are the same, no change will occur in the parity, 

small write operations. Ultimately chosen was RAID level 5 Thus there is no necessity of examining (reading) any of the 

data mapping (i.e., data striping with interleaved parity, as other bits of the other LBN's in slice 6 to determine and 

illustrated in FIGS. 2&4) coupled with dynamic update 5 write the new parity. Knowing only the state of the old data, 

algorithms and writeback caching via the redundant caches the new data and the old parity, one can determine the new 

23, 24 in the redundant controllers 20 (FIG. 3) to overcome parity. The new data and the new parity may be then written 

the small- write penalty. to the disk. 

The Array Management Software 60 manages several Tb e problem of a write hole occurs when writing new or 

disks and presents them to a host operating environment as 10 modified data to an old block and/or writing a new parity and 

virtual disks with different cost, availability, and perfor- a crasD > or power outage occurs, and one write is successful 

mance characteristics than the underlying members. The and the othcr doesn ' t take P lacc - ^ is when thc 
Software 60 may execute either in the disk subsystem or in * ^ to L ■* "inconsistent';. The problem in the 

a host computer. Its principal functions are: 13 that whcn a ? ash °f u * m * c m u lddle °/ a writ u e ' 

/iW t , , . ... if there was no recording of what was being done, the 

(1) to map the storage space available to applications onto is standard or conve ntional manner of proceeding is to exam- 
the array member disks m a way that achieves some ine or reparhy the entire disk and rea]ly the entire ^ 
desired balance of cost, availability, and performance, ^ p Urp ose of the parity is to be able to reconstruct the data 
and > on a disk when the disk crashes. So if parity is inconsistent, 

(2) to present the storage to the operating environment as data cannot be properly reconstructed. If it is not known 
one or more virtual disks by transparently converting 2D which data can be relied upon or if there is trash in one or 
I/O requests directed to a virtual disk to 1/0 operations more LBN's associated with the inconsistent parity, then the 
on the underlying member disks, and by performing material may be impossible, without more, to recover. It 
whatever operations are required to provide any should also be noted that the occurance of a write hole when 
extraordinary data availability capabilities offered by a RAIDset is reduced (drive missing and not replaced) can 
the array 50. 25 lead to the loss of data from the missing member. This is 

Parity RAID array 50 appears to hosts, such as the host because the missing data is represented as the Xor of the data 

computers 31 and 32, as an economical, fault-tolerant virtual from the remaining members and in no other way. If one 

disk unit such as the virtual disk 65, where the blocks of data write succeeds and one does not, this Xor will produce 

appear consecutively (by their LBN) on the drive as shown invalid data. Thus the present invention not only allows us 

in FIG. 4. A Parity RAID virtual disk unit with a storage 30 to make parity consistent quickly reducing the probability of 

capacity equivalent to that of n disks requires n+1 physical losing data but prevents loss of data due to write holes while 

disks to implement. Data and parity are distributed (striped) a RAIDset is reduced. 

across all disk members in the array, primarily to equalize To eliminate this write hole, designers had to develop a 

the overhead associated with processing concurrent small method of preserving information about ongoing RAID 

write requests. 35 write operations across power failures such that it could be 

As has previously been explained with respect to FIG. 1, conveyed between partner controllers in a dual-redundant 
if a disk in a Parity RAID array fails, its data can be configuration. Non -volatile caching of RAID write opera- 
recovered by reading the corresponding blocks on the sur- tions in progress was the manner determined to be of most 
viving disk members and performing a reconstruct, data use to alleviate the problem not only in dual-redundant 
reconstruct or reconstruct read (using exclusive-OR [Xor] 40 configuarions, but in single controller operations, 
operations on data from other members). So one of the problems is to where, if anywhere, the parity 

Recalling that a slice is said to be consistent when parity is inconsistent, and the other problem is if the system crash 

is the "exclusive or" (Xor) of all other bits in that slice, there causes the disk drive to fail, or if the array is reduced, (i.e. 

are three principal algorithms to be considered for modify- one drive missing) it is essential to know that your parity is 

ing data and parity and reconstructing data and parity when 45 consistent so that the reduced, failed etc. drive data may be 

it is desired to modify the same or there is a failure causing reconstructed, merely by Xoring the appropriate or remain - 

the RAID set or a portion thereof to fail. When information ing bit for each remaining member in each slice. In a striped 

is to be written to a disk drive or particular LBN, (1) a Read, RAID 5, such as our example in FIG. 2, (or in FIG. 4) the 

Modify and Write algorithm (RMW) is employed when loss of a drive would of necessity include data and parity, 

parity is consistent and data is being written for some or a 50 For example, if drive 5 in the table of FIG. 2 fails, the parity 

small subset of the members, (2) a Reconstruct Write (RW) for strip 0 and data for strips 1 and 2 will be lost. However, 

algorithm is employed when most or all of the members are the lost data for each slice may be recreated or reconstructed 

being written, and (3) Non-Redundant Write (NRW) is by simply Xoring the remaining LBN's. Another way to 

utilized when when the member or drive containing the look at a write hole is that if all of the writes succeed before 

parity data is missing. 55 a failure, and as long as this is known, then parity will be 

Start with a consistent slice (i.e. parity is correct), and it known to be consistent. The same is true if none of the writes 

is desired to write new data into a block. That means a new commanded are made before the RAID set is interrupted by 

parity for each effected slice must be constructed. Assume a power outage, loss of a drive etc, then the parity will be 

that the RMW algorithm is going to be employed. The consistent. In each of those instances, a write hole problem 

algorithm is the Xor of the old data with the new data, and 60 exists when it is not known whether parity is consistent (that 

the result Xor'd with the old parity to create the new parity. either all writes succeeded or no writes were issued). A write 

By way of example, and referring now to FIG. 5, suppose hole problem thus occurs when updating more than one 

LBN 25 in FIG. 2 was to be modified. Suppose the bit member (drive) and not all of thc writes succeed or insuf- 

written in slice 0 of the old data block (LBN 25) is 0, but the ficient knowledge is obtainable so that it is not evident that 

first bit in the modified slice in data block LBN 25 is 1. An 65 a write has or has not succeeded. 

Xor of the old bit (0) and the new bit (1) gives 1 . Xoring the Double failures may also occur. Power failures that cause, 

result (1) with the old parity of 0 changes the parity for the upon resumption of power, a loss of a drive, are considered 
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double failures, its data where ever tbe system was writing, (2) The device index, which indexes another table, which 

is missing and could be lost if you do not know whether the uniquely identifies the device. Now where the block is 

writes were complete, partially complete etc. and where the data resides or should reside on the 

In accordance with the invention, by first recording all media is known, 

writes and selected "old information" to non-volatile 5 (3) State of the data: 

memory and recording those writes and old data status, so O^unused (free to write) 

that what is known is not only the data, but where it is going, lo writeback — means that data needs to be written to 
and the state of the data. Given this information, parity can the disk at some time in the future, 
be made consistent or data recovered even when a drive was 2=write hole — indicates that this block is dirty data, 
missing prior to or lost during a power failure that occurred 10 needs to be written to a member of a RAID set (either 
while RAID writes were in progress. 3 or 5), but we are in the process of writing it to a 
To this end, and referring now to FIG. 6 which illustrates RAID set which has parity (Raid sets 0 and 1 don't 
an improved flowchart for an RMW (Read, Modify and have parity) This means that the parity of the par- 
Write) the steps include additional information to that dis- ticular slice being written may be inconsistent on that 
closed above with reference to FIG. 5. The steps are as 15 particular slice, if the write hole is set. 
follows, when a write command is issued: 3- this is old data 

1. Read old data (OD) into NV cache (23 or 24). 4=this is old parity — This is really equivalent to 3 

2. Read old (existing) parity (OP) into the NV Cache (23 because given the LBA and the device index (see (1) 
or 24) & (2) above, it is known which member and where 

3. "Mark 1 ' new data (ND) as in the "Write Hole" (WH) 20 il is > gj veQ me chunk size, whether this is parity data 
(record) state or not can De computed. 

(As will be discussed hereinafter, write to a cell which 5»parity valid— Mark means we write to a cell which 

corresponds to a block-cache metadata entry). corresponds to the block, 1 cell maps to 1 unique 

4. "Mark" the old data (OD) as in the "old data" state. block ( cache metadata entry). 

5 Write the new data (ND) 25 a crasn occurs, and the system reboots, this array is 

6. Mark the old parity in the "Old Parity" (OP) state examined by the controller to determine if there is anything 

. . ' m j \ / in the write hole state. The first operation that the controller 

7. Write new parity (Nr). does {s {Q locatc aU (hc parity vaHd (py) Then {{ m]]& qu{ 

8. When both writes .succeed any metadata of other blocks mat fall ^ the same slice 0Q the 
< Mark" at least one of the three blocks as "Parity valid" (PV) 30 same raid sct (that is what fc meant in step 9 above of Qulling 
i.e. either the "old data" (OD), the "new data" (ND) or the out the « marks » for 0ther blocks)> and subsequent i y nulls 
old panty (OP). —This is done in case an interrupt (failure) out the metadata marked parity valid> 

occurs which without more information the controller The buffer indexes for the three blocks is also known. One 

would think a write hole had occurred and some of the data of tnem ^ picked and a « parit valid „ is stored imo it (in f act 

had been lost, when in fact it has not been lost. The reason 35 the device ^ for ^ c parity mem5cr) the ^ of the 

for the original marking is so that tbe new panty can always parity b{ocK and a state which ^dictes "parity valid" is 

be found from Xoring the old data with the new data and stored } ^ the other [wo be nulled Qutj and finall the 

then Xoring the the old parity state, as was shown parity valid mdication may bc nuUed out . $ a crash 

m the example of FIG. 5. occurs after this ^ doQ ^ then that shcQ ^ be not effected 

9. Null out "marks" for "other" blocks 40 by the cras h unless one of the devices are wiped out and the 

10. Null out "mark" for blocks containing the parity valid data has to be reconstructed. In this manner the metadata 
(PV) information has been cleared of anything which would 

The cache 23 and 24 includes a large amount of memory indicate, upon reboot of the system, that there is a problem 

which may conveniently be divided into blocks, e.g. 512 to correct. 

bytes per block. Starting at the beginning of the cache, no 45 Suppose that a crash occurs before parity valid is set but 

matter what address it is, consider every 512 bytes a line write hole has been set, i.e everything is in cache memory 

chunk of memory, as being indexed in an array (the blocks (23 or 24), and both writes are about to be issued. In this 

arranged indexed in an array), which may be numbered case, the old data, the new data and old parity are all in cache 

consecutively for the number of blocks in the cache. The memory. All that is necessary at that stage is to recompute 

index is used to allow indexing into a table to record 50 the new parity by Xoring the old and new data and then 

information about the contents of the block to which the Xoring the result with the old parity. This would create the 

index corresponds. Conversely, the table can be scanned to new parity. The key then is to remember what has been done 

locate blocks in certain states which require error recovery and in what state is the data. 

in the event of a system failure (i.e., crash). The Write hole step is part of metadata. The write back 

In this connection, by setting aside 8 bytes of data for each 55 step must be changed to write hole which tells us is that 

block on the disk for recording "cache metadata entry", parity may be inconsistent and that new data destined for the 

which is indexed by block address converted to a unique specified disk block is present and where in the cache that 

integer, the state of any block in the cache may be deter- b i oc k resides. The other pieces of data from the slice that are 

mined. With a contiguous array of blocks, each having 512 present give the entire picture. What data is present and how 

bytes, this number is derived by simply taking the block 60 it is marked is a function of the algorithm being used, 

address, subtract the base of the array, shift it right by 9 bits Assume that both reads (of old data and old parity) are 

(same as divide by 512), each block will then have a unique issued essentially at the same time. If a crash occurs now, 

integer associated therewith. except that the new data is marked as being dirty and will 

In the cache metadata entry, the following is preserved: eventually have to be written starting at the top of the 

(1) the logical block address (LBA) on the disk, which is 65 algorithm in FIG. 6, there is nothing more to do. 

the address necessary for the drive controller to fetch or Suppose that the old data read finishes first. Mark the old 

write the data. data as old data. If a crash occurs now, error recovery will 
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simply clear the old data in the cache leaving the old data 
still on the media and this will be known because the new 
data hasn't yet been marked as write bole (WH). Suppose 
the new data has been marked WH and before marking the 
old data a crash occurs. In this instance, a write will be 5 
forced of the new data but because the old data cannot be 
found It will be assumed that the old data is still on the media 
(which it is). In this case, missing data may be reconstructed. 
Suppose both the old and new data are marked and a write 
is issued. If a crash occurs here the old parity is correct on 
the media but missing from the cache but may be fetched 
during recovery. Thus, in general, recovery action is dictated 
by which blocks from the slice are present in the cache and 
their states. The key to rewriting or insuring the proper data 
is written to the proper place after a fault, is that of knowing 
what pieces of data are available, and mark those before an 15 
operation is started that is destructive. 

In RMW, and referring now to FIG, 7, take for example 
that it is desired to write new data (Nl) into LBN 0 in Disk 
Drive #1 replacing 01, which was originally in LBN 0. 
Further assume that Disk Drive #3 has failed or is missing 20 
(either before or after a power failure) and that data 03 is 
unavailable, i.e. it is not resident in cache. Further suppose 
that the crash occurred just after issuing writes of the new 
parity and new data Nl. Nl will be marked as WH (write 
hole). You will also have in cache, 01 (old data) for LBN 0 25 
as well as OP (old parity) for the previous parity as read from 
disks #1 and #5 respectively Subsequent to a crash, error 
recovery will recompute the new parity as the Xor of 01, 
Nl, and OP and write it out to the appropriate disks along 
with new data Nl. After these writes occur, the parity will be 30 
consistent again and error recovery will go through PV steps 
to clean up the metadata. It should be recognized that 03 
will always be able to be computed unless another disk in the 
RAID set is lost. Before the write hole recovery, 03 is the 
Xor of Ol, 02, 04 and OP. After write hole recovery, 03 is 35 
the Xor of Ni, 02, 04 and NP. 

One more condition: suppose a spare drive had been put 
into missing Drive #3 but before tie write error recovery 
takes place. 03 is computed as stated above, i.e. the Xor of 
01, 02, 04 and OP, and then written to the replacement 
spare or substitute drive, and clear the FE bit that corre- 
sponds. 

By way of further example, assume that Disk Drive #3 
was present at the time of the power failure, but replaced 
with a spare just after reboot (or failover) and all the FE bits 
we re set to one. This is how the write hole is recovered: 
The data in the cache is labelled as Ni, 01, and OP. The 
data on the disk is 02, 03 and 04. The steps may be 
considered as follows: 

1. Determine the Metadata FE's for all members (03 is 
the only one set). 

2. Read 02 and 04. 

3. Xor 01, 02, 04, Op to give 03. 

4. Write 03 to Disk Drive #3, 

5. Clear FE bit for 03 

6. Xor 01, Nl, OP to give NP 

7. Redundantly mark 01 as 01, Nl as WH, and OP as 
OP. 

8. Write Nl to Disk Drive #1 and NP to Disk Drive #5. 

9. Do PV steps of marking @ least one of 01, OP or Nl 
as parity valid (PV). 

10. Null out marks for "other blocks" not marked in 9 
above, and; 

11. Null out mark for block selected as PV in step 9. 
In a reconstruct write (RW) the situation is that of 

determining which data blocks are present (to be written) for 
each data block high slice and taking the following steps: 
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1. Read into cache all non-parity blocks (i.e. data blocks) 
of members for which there is no new data to write. 
When complete;. 

2, Mark all new data blocks with "WH", and all old data 
blocks with "Old data". — when marked with WH, 
(write hole), this is saying (if crash occurs) that in 
process of writing; the data on the disk is unknown and 
must be treated as suspect. 

3 Compute new parity as Xor of all new data blocks and 
any old data blocks read. 

4. Issue writes of new data. 

5. Issue write of new parity. 

6. When all writes complete, mark one block (Old data, 
new data) "PV" (parity valid). Reason. If failure occurs 
at this point in time, when the system comes back up, 
it will scan the array for PV's. If it sees one, it will then 
null out anything in the same slice because it knows 
that what was written was complete. 

7. Mark "other" blocks null. (Setting to 0 state, unused, 
free etc.) 

8. Mark the selected block null (step 6.) 

For recovery purposes, the important consideration is that 
before any writes are done, the old data is marked as well as 
the new data. This protects against loss of a drive since all 
of the data except parity is available after a crash. Thus, 
double failures are handled for this algorithm as well. 

StorageWorks™ controller firmware developers, have 
automated Parity RAID management features rather than 
require manual intervention to inhibit write hole problems 
after failures. Controller-based automatic array management 
is superior to manual techniques because the controller has 
the best visibility into array problems and can best manage 
any situation given proper guidelines for operation. 

Thus the present invention provides a novel method and 
apparatus for reconstructing data in a computer system 
employing a parity RAID data protection scheme. By 
employing a write back cache composed of non-volatile 
memory for storing (1) writes outstanding to a device, (2) 
selected "old data", and (3) storing metadata information in 
the non-volatile memory, it may be determined if, where and 
when the write was intended or did happen when the crash 
occurred. An examination is made to determine whether 
parity is consistent across the slice, and if not, the data in the 
non-volatile write back cache is used to reconstruct the write 
that was occurring when the crash occurred to insure con- 
sistent parity, so that only those blocks affected by the crash 
have to be reconstructed. 

Although the invention has been described with a certain 
degree of particularity, it should be recognized that elements 
thereof may be altered by person(s) skilled in the art with out 
departing from the spirit and scope of the invention as 
hereinafter set forth in the following claims. 

What is claimed is: 

1. A method of reconstructing data in a computer system 
employing a Parity RAID protection scheme for a striped 
array of storage devices that employ parity recovery in the 
event of a crash, said computer system including a write 
back cache composed of non-volatile memory for storing (1) 
write data outstanding that is to be written to storage 
devices, and (2) metadata information; said metadata infor- 
mation comprising a first field containing an LB A of said 
write data outstanding, a second field containing device IDs 
that correspond to said write data outstanding, and a third 
field containing status that indicates consistent or inconsis- 
tent write slice parity, comprising the steps of: 

storing old data in said non-volatile memory from storage 
devices that are intended for said write data 
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outstanding, to protect said old data in the event a crash sponding to said outstanding write data, and a third field 

occurs during a write to a storage device; containing a parity consistent/inconsistent status of data in 

storing old parity that corresponds to said old data in said sa id non-volatile memory, comprising the steps of: 

non-volatile memory; determining from said metadata information a storage 

determining from said metadata information where a 5 device for which said outstanding write data is 

given write data outstanding was intended when a crash intended; 

occurs; storing in said non-volatile memory associated old data 

determining whether parity is consistent across a write & om at l east we-as on storage devices intended for 

slice corresponding to said given write data outstanding write data, to protect said associated old 

outstanding, and if parity is not consistent, using said 10 data m the cvent a crasn occurs during a write of said 

old data stored in said non-volatile memory and said outstaxidinrg write data to a storage device; 

the old parity stored in said non-volatile memory to determining whether parity is consistent across a data 

reconstruct said given write data outstanding to thereby slice that includes said LBA, and if not using data 

insure consistent parity, whereby; stored in said non-volatile memory to reconstruct a 

only slices of said given write data outstanding whose write of said outstanding write data that was in progress 

parity is not consistent and are affected by the crash when a crash occurred to insure consistent parity, 

have to be reconstructed. whereby; 

2. A method in accordance with claim 1 including the step only data slices that have inconsistent parity and that are 
of: ^ affected by a crash have to be reconstructed. 

storing said write data outstanding in said non-volatile U- A method in accordance with claim 10 including the 

memory; and steps of: 

marking said stored write data outstanding with indicia marking said write data that is outstanding to a storage 

indicating that said stored write data outstanding is in device with indicia indicating that said write data that 

a write hole state, 25 ^ outstanding to a storage device is in a write hole state. 

3. A method in accordance with claim 2, including the step 12. A method in accordance with claim 11 including the 
of: ste P ok 

marking said old data stored in said non-volatile memory marking said associated old data storing in said non- 

as being in an old data state. volatile memory as being in an old data state. 

4. A method in accordance with claim 3, including the step 30 13. A method in accordance with claim 12, including the 
of: step of: 

writing said write data outstanding to a designated storage writing said write data that is outstanding to a storage 

device. device to a designated storage device. 

5. A method in accordance with claim 4, wherein said step 14. A method in accordance with claim 13, including the 
of writing said write data outstanding to a designated storage 35 ^P 5 

device occurs subsequent to said step of storing said write calculating a new parity for said write data that is out- 
data outstanding in said non- volatile memory. standing to a storage device; and 

6. A method in accordance with claim 5, including the writing said new parity to a storage device. 

steps of: 15. a method in accordance with claim 14 including the 

calculating a new parity for said write data outstanding; 40 step of: 

an d subsequent to said step of writing said write data that is 

writing said new parity to a storage device. outstanding to a storage device to a designated storage 

7. A method in accordance with claim 6 including the step device, and said step of writing said new parity to a 
of: 45 storage device, marking at least one of said write data 

subsequent to writing said write data outstanding and said that is outstanding to a storage device or said new data 

new parity to a storage device, marking at least one of 35 being parity valid. 

said old data, said write data outstanding or said old 16. A method in accordance with claim 15, including the 

parity as parity valid. ste P °f ; 

8. A method in accordance with claim 7, including the step 50 subsequent to said parity valid marking step as set forth in 
of: claim 15, marking null blocks that are not marked 

subsequent to said step of marking as parity valid, nulling parity valid, 

out marks for other blocks that were not marked as 17. A method in accordance with claim 16, including the 

parity valid by the step of claim 7. ste P 

9. A method in accordance with claim 8, including the step 55 subsequent to said marking null step as set forth in claim 
of: 16, nulling said parity valid marks. 

subsequent to the step set forth in claim 8, nulling out said 18. In a computer system employing a Parity RAID 

parity valid mark of claim 7. protection scheme, and an array set of disk drives connected 

10. For use in connection with a computer system, a to said computer system, said disk drives containing data 
method of reconstructing data in a Parity RAID protection 60 written thereon in a predetermined and proscribed format of 
scheme for attachment to said computer system, a write back d ata blocks: 

cache composed of non-volatile memory for storing (1) a controller having a write back cache composed of 

write data that is outstanding to a storage device and non-volatile memory, said non-volatile memory being 

associated old data read from a storage device, and (2) disposed intermediate said array set of disk drives and 

metadata information; said metadata information compris- 65 said computer system; 

ing a first field containing a LBA of said outstanding write said non-volatile memory storing (1) data block writes 

data, a second field containing a storage device ID cone- outstanding to a disk drive, (2) associated data read 



04/19/2004, EAST Version: 1.4.1 



5,774,1 

17 

from a disk drive, and (3) metadata information con- 
cerning information contained on said disk drives; 

said metadata information comprising a first field con- 
taining an LBA of the associated data read, a second 
field containing a disk drive ID, and a third field 5 
containing a block status of data blocks contained on 
said disk drives; 

a program contained within said controller for determin- 
ing from said metadata information where a write is 
intended prior to a write being made to a disk drive of 10 
said array set; and 

means in said controller for determining whether parity is 
consistent across a slice of multiple data blocks written 
on a disk drive, and if not using data stored in said 
non -volatile memory to reconstruct a write to insure 
consistent parity, whereby reconstruction of data slices 
is necessary only for those slices on said set of disks 
drives for which inconsistent parity is determined. 

19. In combination: 2Q 

a computer having a central processor; 

an array of disk drives comprising computer- read able disk 
memory having a surface formed with a plurality of 
binary patterns constituting an application program that 
is executable by said computer; 25 

a controller connected intermediate said array of disk 
drives and said central processor; 

said disk drive array in conjunction with said controller 
employing a Parity RAID protection scheme; 



is 

a non-volatile cache contained within said controller; 
said controller effecting reading and writing of data 

blocks from and to said array of disk drives; 
said application program including instructions for a 

method of reconstructing data slices in said disk drive 

array; 

said application program including instructions for read- 
ing into said non-volatile cache (1) writes outstanding 
to a disk drive, (2) associated data read, and (3) 
metadata information concerning said disk drive array; 

said metadata information comprising a first field con- 
taining LBAs of data blocks, a second field containing 
disk drive IDs, and a third field containing data block 
status; 

said application program including instructions for deter- 
mining from said metadata information to which disk 
drive a write is intended; 

said application program including instruction for deter- 
mining whether parity is consistent across a data block 
slice, and if not using data in said non-volatile cache to 
construct or to reconstruct write data, when necessary, 
to insure consistent parity, whereby only data block 
slices effected by inconsistent parity will have to be 
constructed or reconstructed. 

***** 
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